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Abstract 


This  paper  studies  the  performance  implications  of  architectural  synchronization 
support  for  automatically  parallelized  numerical  programs.  As  the  basis  for  this  work, 
we  analyze  the  needs  for  synchronization  in  automatically  parallelized  numerical  pro¬ 
grams.  The  needs  are  due  to  task  management,  loop  scheduling,  barriers,  and  data 
dependency  handling.  We  present  synchronization  algorithms  for  efficient  execution 
of  programs  with  nested  parallel  loops.  Next,  we  identify  how  various  hardware  syn¬ 
chronization  primitives  can  be  used  to  satisfy  these  software  synchronization  needs. 
The  synchronization  primitives  studied  are  test&set,  fetch&add,  exchange-byte  and 
synchronization  bus  implementation  of  lock/unlock  operations.  Lastly,  we  ran  experi¬ 
ments  to  quantify  the  impact  of  various  architectural  support  on  the  performance  of  a 
bus-based  shared  memory  multiprocessor  running  automatically  parallelized  numerical 
programs.  We  found  that  supporting  an  atomic  fetch&add  primitive  in  shared  memory 
is  as  effective  as  supporting  lock/unlock  operations  with  a  synchronization  bus.  Both 
achieve  substantial  performance  improvement  over  the  cases  where  atomic  test&set  and 
exchange-byte  operations  are  supported  in  shared  memory. 


1  Introduction 

Automatically  parallelized  numerical  programs  represent  an  important  class  of  parallel  appli¬ 
cations  in  high-performance  multiprocessors.  These  programs  are  used  to  solve  problems  in 
many  engineering  and  science  disciplines  such  as  Civil  Engineering,  Mechanical  Engineering, 
Electrical  Engineering,  Chemistry,  Physics,  and  Life  Sciences.  In  response  to  the  popular 
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demand,  parallelizing  FORTRAN  compilers  have  been  developed  for  commercial  and  experi¬ 
mental  multiprocessor  systems  to  support  these  applications  [2]  [12]  [1]  [8]  [11].  With  maturing 
application  and  support  software,  the  time  has  come  to  stud}'  the  architecture  support  re¬ 
quired  to  achieve  high  performance  for  these  parallel  programs. 

Synchronization  overhead  has  been  recognized  as  an  important  source  of  performance 
degredation  in  the  execution  of  parallel  programs.  Many  hardware  and  software  tech¬ 
niques  have  been  proposed  to  reduce  the  synchronization  cost  in  multiprocessor  systems 
[13]  [20]  [19]  [3]  [14]  [15]  [16].  Instead  of  proposing  new  synchronization  techniques,  we  address 
a  simple  question  in  this  paper:  does  architecture  support  for  synchronization  substantially 
affect  the  performance  of  automatically  parallelized  numerical  programs? 

To  answer  this  question,  we  start  with  analyzing  the  needs  of  synchronization  in  par¬ 
allelized  FORTRAN  programs  in  Section  2.  Due  to  the  mechanical  nature  of  parallelizing 


compilers,  parallelism  is  expressed  in  only  a.  few  structured  forms.  This  parallel  program¬ 
ming  style  allows  us  to  systematically  cover  all  the  synchronization  needs  in  automatically 
parallelized  programs.  Synchronization  issues  arise  in  task  management,  loop  scheduling, 
barriers  and  data  dependence  handling.  A  set  of  algorithms  are  presented  which  use  generic 
lock()/unlock()  and  increment!)  operations.  We  then  identify  how  several  hardware  syn¬ 


chronization  primitives  can  be  used  to  implement  these  generic  synchronization  operations. 
These  synchronization  primitives  are  tesi&set ,  fetch&add ,  exchange-byte,  and  lock/unlock 
operations  .  Since  these  primitives  differ  in  functionality,  the  algorithms  for  synchronization 


in  parallel  programs  are  implemented  with  varying  efficiency. 
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Dlst 


Section  3  describes  the  experimental  procedure  and  the  scope  of  our  experiments.  In 
Section  4,  the  issue  of  loop  scheduling  overhead  is  addressed  in  the  context  of  hardware 
synchronization  support.  We  present  an  anlytical  model  for  the  effect  of  loop  scheduling 
overhead  and  loop  granularity  on  execution  oime.  Furthermore  we  measure  loop  scheduling 
overhead  for  different  synchronization  primitives  with  simulation. 

Synchronization  needs  of  a  parallel  application  depend  on  the  numerical  algorithms  and 
the  effectiveness  of  the  parallelization  process,  therefore  the  performance  implications  of  ar¬ 
chitectural  synchronization  support  can  only  be  quantified  with  experimentation.  Section  5 
addresses  the  issues  of  granularity  and  lock  locality  in  real  applications.  Using  programs 
selected  from  the  Perfect  Club  [6]  benchmark  set,  we  evaluate  the  impact  of  various  architec¬ 
tural  support  on  the  performance  of  a  bus- based  shared-memory  multiprocessor  architecture 
in  Section  6.  We  conclude  that  architectural  support  for  synchronization  has  a  profound 
impact  on  the  performance  of  the  benchmark  programs. 

Finally,  the  related  work  is  presented  in  Section  7  and  Section  8  includes  the  concluding 
remarks. 


2  Background 

In  this  section,  we  first  describe  how  parallelism  is  expressed  in  parallel  FORTRAN  pro¬ 
grams.  We  then  analyze  the  synchronization  needs  in  the  execution  of  these  programs. 
Most  importantly,  we  show  how  architectural  support  for  synchronization  can  afTect  their 
performance. 
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DOALL  30  J=1,J1 
X(II1+J)  =  X(II1+J)  *  SCI 
Y (II1+J)  =  Y(II1+J)  *  SCI 
ZClIl+J)  =  Z(II1+J)  *  SCI 
30  CONTINUE 

Figure  1:  A  DOALL  loop 

2.1  Parallel  FORTRAN  Programs 

The  application  programs  used  in  this  study  are  selected  from  the  Perfect  Club  benchmark  set 
[6].  The  Perfect  Club  is  a  collection  of  numerical  programs  for  benchmarking  supercomput¬ 
ers.  The  programs  were  written  in  FORTRAN.  For  our  experiments,  they  were  parallelized 
by  the  Cedar  source-to-source  parallelizer  [11]  which  generates  a  parallel  FORTRAN  dialect, 
Cedar  FORTRAN.  This  process  exploits  parallelism  at  the  loop  level  and  loop  level  paral¬ 
lelization  has  been  shown  to  capture  most  of  the  available  parallelism  for  these  programs  [7]. 
Cedar  FORTRAN  has  two  major  constructs  to  express  loop  level  parallelism:  DOALL  and 
DOACROSS  loops.  A  DOALL  loop  is  a  parallel  DO  loop  where  there  is  no  dependence  be¬ 
tween  the  iterations.  The  iterations  can  be  executed  in  parallel  in  arbitrary  order.  Figure  1 
shows  an  example  of  a  DOALL  loop. 

In  a  DOACROSS  loop  [9],  there  is  a  dependence  relation  across  the  iterations.  A 
DOACROSb  loop  has  the  restriction  that  iteration  i  can  only  depend  on  iterations  j  where 
j  <  i.  Because  of  this  property,  even  a  simple  loop  scheduling  scheme  can  guarantee  deadlock 
free  allocation  of  DOACROSS  loop  iterations  to  processors.  In  Cedar  FORTRAN,  depen¬ 
dences  between  loop  iterations  are  enforced  by  Advance/Await  synchronization  statements 
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[2].  An  example  of  a  DOACROSS  loop  is  shown  in  Figure  2.  The  first  argument  of  Advance 
and  Await  statements  is  the  name  of  the  synchronization  variable  to  be  used.  The  second 
argument  of  an  Await  statement  is  the  data  dependence  distance  in  terms  of  iterations.  In 
this  example,  when  iteration  i  is  executing  this  Await  statement,  it  is  waiting  for  iteration 
i  —  3  to  execute  its  Advance  statement.  The  third  argument  of  Await  is  used  to  enforce 
sequential  consistency  in  Cedar  architecture  [11].  The  third  argument  implies  that  upon 
the  completion  of  synchronization,  the  value  of  X(I-3)  should  be  read  from  shared  memory. 
Similarly,  the  second  argument  of  Advance  statement  implies  that  writing  the  value  X(I)  to 
shared  memory  should  be  completed  before  Advance  statement  is  executed. 

DOACROSS  40  1=4, IL 

AWAIT(1,  3,  X(I-3)) 

X(I)  =  Y(I)  +  X(I-3) 

ADVANCE  (1,  X(I)> 

30  CONTINUE 

Figure  2:  A  DOACROSS  loop 


2.2  Synchronization  Needs 

In  executing  parallel  FORTRAN  programs,  the  need  for  synchronization  arises  in  four  con¬ 
texts:  task  management,  loop  scheduling,  barrier  synchronization,  and  Advance/Await.  Task 
management  is  used  for  starting  the  execution  of  a  parallel  loop  on  multiple  processors.  In 
this  study,  task  management  is  implemented  by  a  global  task  queue.  The  processor  which 
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put_task()  { 

nevJ.oop->number_of_processors  *  0  ; 

new_loop->numbfer_of_iterations  *  number  of  iterations  in  loop; 

new_loop->barrier_counter  =  0  ; 

new_loop“>iteration_counter  -  0  ; 

lock  (task-queue)  ; 

insert_task_queue(new_loop)  ; 

task_queue_status  »  NOT-EMPTY  ; 

unlock(task-queue)  ; 

} 

Figure  3:  Producer  algorithm  for  loop  distribution 


executes  a  DOALL  or  DOACROSS  statement  places  the  loop  descriptor  in  the  global  task 
queue.  All  idle  processors  receive  this  loop  descriptor  and  start  the  execution  of  the  loop 
iterations.  The  accesses  to  the  task  queue  by  the  processors  are  mutually  exclusive.  In  our 
implementation,  we  use  a  task  queue  lock  to  enforce  mutual  exclusion.  Figures  3  and  4  show 
the  algorithms  for  the  processor  which  executes  the  parallel  DO  statement  and  for  the  idle 
processors  respectively.  The  removal  of  the  loop  descriptor  from  the  task  queue  is  addressed 
in  the  discussion  of  the  barrier  synchronization  algorithm. 

The  implementation  issues  for  the  functions  lockO,  unlockO,  and  increment  ()  with 
different  primitives  is  presented  in  the  next  section.  By  definition  lockO  and  unlockO 
operations  are  atomic.  Whenever  underlined  in  an  algorithm,  the  increment  ()  operation 
is  also  atomic  and  can  be  implemented  with  a  sequence  of  lock,  read-modify- write,  unlock 
operations. 

During  the  execution  of  a  loop,  each  processor  is  assigned  with  different  iterations.  This 
is  called  loop  scheduling.  We  used  the  processor  self-scheduling  algorithm  [18]  to  implement 
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read_task()  { 

while  (task_queue_status  =  EMPTY)  ; 
lock(task_queue)  ; 

current-loop  =  read_task_queue.head()  ; 

/*  Doesn't  remove  the  loop  from  the  queue  */ 
increment  (current _loop->number_of  .processors)  ; 
unlock(task-queue)  ; 

} 


Figure  4:  Consumer  algorithm  for  loop  distribution 


scheduleJiterationO  { 

last_iteration  =  increment  (current  JLoop->iteration_counter)  ; 
if  (last_iteration  >=  current_loop->number_of_iterations)  { 
barrier  synchronization  ; 

} 

else  { 

execute  (last_iteration  +  1  )th  iteration  of  loop; 

} 

} 

Figure  5:  Self  scheduling  algorithm  for  loop  iterations 


loop  scheduling.  In  processor  self-scheduling,  each  processor  executes  the  self-scheduling  code 
before  executing  a  parallel  loop  iteration.  The  self-sch editing  algorithm  shown  in  Figure  5  is 
executed  at  the  beginning  of  each  loop  iteration  and  it  uses  an  atomic  increment  operation 
on  a  shared  counter.  Unless  the  multiprocessor  supports  an  atomic  fetch&add  operation,  a 
lock  is  required  to  enforce  the  mutually  exclusive  accesses  to  the  shared  counter. 

After  ail  iterations  of  a  loop  are  executed,  processors  synchronize  at  a  barrier.  For  barrier 
synchronization,  we  :sed  a  non-blocking  linear  barrier  algorithm  which  is  implemented  with 


8 


barrier_synchronization()  { 

if  (currer+.J.oop->barrier_counter  ==»  0)  { 
lock(task_queue)  ; 

if  (current JLoop  ==  read_task_queue_head())  { 
delete_task_queueJiead()  ; 

if  (task_queue_empty()  ==  TRUE)  task_queue_status  =  EMPTY  ; 

} 

unlock(task.queue)  ; 

} 

if  (increment (current _loop->barrier_counter)  == 
current_loop->number_of_processors  -  i)  { 
resume  executing  program  from  the  end  of  this  loop  ; 

} 

else  read_task()  ; 


Figure  6:  Barrier  synchronization  algorithm 

a  shared  counter  (see  Figure  6).  After  all  iterations  of  a  parallel  loop  have  been  executed, 
each  processor  reads  and  increments  the  barrier  counter  associated  with  the  loop.  The 
last  processor  to  increment  the  counter  completes  the  execution  of  the  barrier.  As  in  the 
case  of  loop  self-scheduling,  unless  the  multiprocessor  system  supports  an  atomic  fetch&add 
operation,  the  mutually  exclusive  accesses  to  the  shared  counter  are  enforced  by  i  lock. 

In  this  algorithm,  the  first  processor  to  enter  the  barrier  removes  the  completed  loop 
from  the  task  queue.  Using  this  barrier  synchronization  algorithm,  the  processors  entering 
the  barrier  do  not  wait  for  the  barrier  exit  signal  and  can  start  executing  another  parallel 
loop  whose  descriptor  is  in  the  task  queue.  In  contrast  to  the  compile  time  scheduling  of 
fuzzy  barrier’  [15],  this  algorithm  allows  dynamic  schedulin'  of  loops  to  the  processors  in 
a  barrier. 
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In  its  presented  form,  the  last  processor  to  enter  the  barrier  executes  the  continuation  of 
the  parallel  loop  —  the  code  in  the  sequential  FORTRAN  program  that  is  executed  after  all 
iterations  of  the  current  loop  are  completed1. 

This  combination  of  task  scheduling,  iteration  self  scheduling  and  non-blocking  barrier 
synchronization  algorithms  allows  deadlock  free  execution  of  nested  parallel  loops  with  the 
restriction  that  DO  ACROSS  loops  appear  only  at  the  deepest  nesting  level  [18]. 

The  overhead  associated  with  task  management  and  barrier  synchronization  depends  on 
the  number  of  participating  processors.  When  an  P  processor  system  is  executing  a  parallel 
loop  with  N  iterations,  this  task  can  be  distributed  to  at  most  P  processors.  Therefore,  at 
most  P  processors  synchronize  at  a  given  barrier.  Using  processor  self-scheduling,  the  N 
iterations  are  distributed  to  processors  one  at  a  time. 

For  the  last  type  of  synchronization,  the  ADVANCE/AWAIT  statements  are  implemented 
by  a  vector  for  each  synchronization  point.  In  executing  a  DOACROSS  loop,  iteration  i, 
waiting  for  iteration  j  to  reach  synchronization  point  synch_pt,  busy  waits  on  location 
V[synch_pt]  [j].  Upon  reaching  point  synch_pt,  iteration  j  sets  location  VCsynch.pt]  [j]. 
This  implementation,  as  shown  in  Figure  7,  uses  regular  memory  read  and  write  operations, 
thus  does  not  require  atomic  synchronization  primitives.  This  implementation  assumes  a 
coherent  and  sequentially  consistent  memory  system.  In  the  presence  of  a  memory  system 
with  weak  ordering,  an  AWAIT  statement  can  be  executed  only  after  the  previous  memory 
write  operations  complete  execution.  For  a  multiprocessor  with  software  controlled  cahce 

*By  using  a  semaphore,  the  processor  which  executed  the  corresponding  DOALL/DOACROSS  statement 
can  be  made  to  wait  for  the  barrier  exit  to  execute  the  continuation  of  the  loop. 
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initialization(synch_pt)  '{ 

for  (i  *  i  ;  i  <  number.of  .iterations  ;  i++)  VCsynch_pt]  [i]  *  0  ; 

} 

advance  (synch_pt)  { 

V[synch_pt]  [iteration .number]  *  1  ; 

} 

await  (synch.pt,  dependence-distance)  { 

if  (iteration-number  <=  dependence_distance)  returnO  ; 

else  while  (V [synch.pt]  [iteration .number  -  dependendence_di stance]  ==  0)  ; 

} 


Figure  7:  Algorithm  for  ADVANCE/AWAIT  operations 


coherency  protocols,  Cedar  FORTRAN  ADVANCE/AWAIT  statements  include  the  list  of 
variables  whose  values  should  be  read  from/written  to  shared  memory  before  their  execu¬ 
tion.  The  implementation  details  of  these  statements  in  multiprocessors  with  weakly  ordered 
memory  models  or  software  controlled  cache  coherency  protocols  are  beyond  the  scope  of 
this  paper. 

In  a  multiprocessor  which  does  not  support  an  atomic  fetch&add  operation,  lock  accesses 
play  an  important  role  in  the  execution  of  scheduling  and  barrier  synchronization  algorithms 
The  next  section  discusses  the  implementation  issues  of  lock  accesses  in  the  presence  of 
different  synchronization  primitives. 


2.3  Locks  and  Hardware  Synchronization  Primitives 

In  executing  numeric  parallel  programs,  locks  are  frequently  used  in  synchronization  and 
scheduling  operations.  In  the  task  schedi  ling  algorithm  (See  Figures  3  and  4),  the  use  of 
locks  enforces  mutually  exclusive  access  of  processors  to  the  task  queue.  Locks  are  also  used 
to  ensure  correct  modification  of  shared  counters  when  an  atomic  fetch&add  operation  is  not 
supported  by  the  architecture.  Such  shared  counters  are  used  both  by  loop  scheduling  (See 
Figure  5)  and  barrier  synchronization  (See  Figure  6)  algorithms. 

There  are  several  algorithms  for  implementing  lock  accesses  in  cache  coherent  multipro¬ 
cessors  using  hardware  synchronization  primitives[3][14].  These  algortihms  have  different 
dynamic  characteristics.  Consider  the  execution  of  a  linear  barrier,  which  is  implemented  by 
a  shared  barrier  counter.  In  the  case  where  all  processors  arrive  at  the  barrier  at  the  same 
time,  a  simple  spin  lock  algorithm  to  enforce  exclusive  access  to  the  counter  will  cause  ex¬ 
cessive  bus  traffic.  This  would  slow  down  the  execution  of  the  barrier.  However,  in  the  best 
case,  processors  would  arrive  at  the  barrier  at  different  times,  causing  no  lock  contention.  In 
this  case,  the  latency  of  a  lock  operation  is  important  for  the  overhead  in  entering  a  barrier. 
An  important  tradeoff  in  lock  algorithms  is  their  performance  under  heavy  load  versus  the 
latency  of  lock  operations.  In  Section  4  we  analyze  the  implications  of  using  different  lock 
algorithms  on  the  performance  of  loop  scheduling. 

All  existing  multiprocessor  architectures  provide  some  hardware  support  for  atomic  syn¬ 
chronization  operations.  Functionally,  any  synchronization  primitive  can  be  used  to  satisfy 
the  high  level  synchronization  needs  of  a  parallel  program.  In  practice,  different  primitives 
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may  result  in  very  different  performance  levels.  For  example,  a  queuing  lock  algorithm  [3]  [14] 
can  be  implemented  efficiently  with  an  exchange-byte  or  a  fetch&add  primitive  but  a  test&set 
implementation  is  more  complicated  and  may  be  inefficient. 

The  exchange-byte  version  of  the  queuing  lock  algorithm  is  shown  in  Figure  8.  In  this 
implementation,  the  exchange-byte  primitive  is  used  to  construct  a  logical  queue  of  processors 
which  contend  for  a  lock.  The  variable  my.id  is  assumed  to  be  set  at  the  start  of  the  program 
such  that  it  is  value  for  the  ith  processor  is  i.  The  variable  queue_tail  holds  the  I.D.  of 
the  last  processor  which  tried  to  acquire  this  lock.  A  processor  which  tries  to  access  the 
lock  receives  the  I.D.  of  the  processor  which  preceded  it  and  writes  its  own  I.D.  into  the 
variable  queue.tail.  This  algorithm  constructs  a  queue  of  processors  waiting  for  a  lock 
where  each  processor  waits  only  on  its  predecessor  for  the  release  of  the  lock.  By  mapping 
the  elements  of  synchronization  vector  flags  []  to  non-conflicting  cache  lines,  the  memory 
accesses  in  the  while  loop  of  this  algorithm  can  be  confined  to  individual  caches  of  processors. 
When  a  processor  releases  the  lock,  only  the  cache  line  read  by  its  successor  in  the  queue  is 
invalidated. 

In  implementing  the  queuing  lock  algorithm  with  the  test&set  primitive,  because  of  the 
functional  limitations  of  this  primitive,  the  queue  of  processors  contending  for  a  lock  can 
not  be  constructed  with  a  single  atomic  operation.  This  introduces  a  critical  section  into 
the  implementation  of  queuing  operation,  which  requires  the  use  of  an  auxiliary  lock.  When 
test&set  is  used  to  emulate  the  exchange-byte  primitive  in  the  algorithm  in  Figure  8,  the 
queuing  operation  becomes  the  sequence  of  operations 
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initializationO  { 
flags [0]  =  FREE  ; 
flags[l...P]  -  BUSY  ; 
queue-tail  =  0  ; 

} 

lock()  { 

queue-last  =  exchange-byte (my  _id,  queue-tail)  ; 
while  (flags  [queue-last]  =-  BUSY)  ; 
flags  [queue-last]  =  BUSY  ; 

} 

unlock ()  { 

flags  [my_id]  =  FREE  ; 

} 

Figure  8:  Queuing  lock  algorithm  for  lock  accesses 


lock(auxilaryJock)  ;  read(queue.tail)  ;  write(myJd)  ;  unlock(auxilaryJoclc). 

Clearly,  the  lock  operations  used  in  the  implementation  of  the  queuing  lock  algorithm  need 
to  be  implemented  with  a  different  algorithm  like  test&test&set  (see  Figure  9). 

In  the  synchronization  algorithms  presented  in  Section  2.2,  most  of  the  lock  operations 
are  used  to  implement  atomic  increment  operations.  The  critical  section  involved  in  an 
atomic  increment  operation  consists  of  one  memory-read,  one  addition,  and  one  memory- 
write  instruction,  which  is  similar  to  the  critical  section  used  in  emulating  an  exchange-byte 
operation  (one  memory-read  and  one  memory-write  instruction).  Therefore,  the  overhead 
of  constructing  the  lock  queue  in  the  test&set  implementation  of  a  queuing  lock  would  be 
similar  to  the  overhead  in  using  the  test&test&set  algorithm  to  implement  lock  operations  for 
accessing  shared  counters.  Therefore,  whenever  the  architecture  supports  only  the  test&set 
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lockO  { 

while(lock  ==  BUSY  I  I  testftset(lock)  ==  BUSY)  ; 

} 

unlockO  { 

lock  =  CLEAR  ; 

} 

Figure  9:  Test&test&set  algorithm  for  lock  accesses 

primitive,  a  plain  test&test&set  algorithm  is  used  in  this  study  to  implement  all  lock  opera¬ 
tions  2.  For  an  architecture  with  the  exchange-byte  primitive,  the  queuing  lock  algorithm  is 
used  for  lock  operations. 

Due  to  the  emphasis  on  atomic  increment  operations,  supporting  a  fetch&add  operation 
in  hardware  can  significantly  decrease  the  need  for  lock  accesses  in  the  synchronization 
algorithms.  When  fetch&add  is  the  main  synchronization  primitive  of  a  system,  we  used  a 
fetch&add  implementation  of  test&test&set  algorithm  to  support  the  lock/unlock  operations 
in  task  management.  The  performance  implications  of  supporting  a  fetch&add  primitive  on 
loop  scheduling  algorithm  will  be  presented  in  Section  4. 

The  functionality  and  sophistication  of  hardware  synchronization  support  increase  the 
cost  of  a  system.  In  the  Alliant  FX/8,  a  separate  synchronization  bus  and  a  Concurrency 
Control  Unit  is  provided  [2]  and  which  can  improve  parallel  program  performance  by  reducing 
the  and  latency  of  lock  accesses  and  the  memory  contention  caused  by  them.  Therefore  in 

“However,  We  would  like  to  point  out  that  in  an  environment  where  critical  sections  of  algorithms  in¬ 
volve  many  instructions  and  memory  accesses,  a  tesi&sei  implementation  of  a  queuing  lock  may  enhance 
performance. 
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our  analysis  in  Section  4,  we  also  consider  the  case  where  a  synchronization  bus  is  used 
to  implement  lock  operations.  Finally,  the  cost  performance  tradeoffs  for  syncb**onization 
support  can  only  be  decided  by  evaluating  the  performance  implications  of  different  schemes 
for  real  parallel  applications.  These  experiments  are  presented  in  Section  6. 

3  Experimental  Method 

To  evaluate  the  performance  of  several  parallel  processor  architectures,  we  used  a  high-level 
trace  driven  simulator.  In  our  approach,  we  used  an  abstract  model  of  parallel  program  exe¬ 
cution  which  presents  a  simplified  view  of  the  application  program  while  allowing  a  detailed 
evaluation  of  synchronization  support. 

Execution  of  a  sequential  program  on  one  of  the  processors  of  a  shared  memory  MIMD 
machine  can  be  modeled  by  partitioning  the  program  execution  time  into  two  sections:  the 
execution  that  is  local  to  the  processor  (execution  of  instructions  in  the  CPU  and  memory 
accesses  to  the  local  cache)  and  the  time  spent  in  handling  memory  requests  to  the  shared 
memory.  For  computationally  intensive  numeric  applications  where  I/O  and  system  calls  ac¬ 
count  for  a  small  fraction  of  execution  time,  this  simplistic  model  can  be  used  to  approximate 
sequential  program  execution  time. 

Using  a  RISC  based  processor  model  where  instruction  execution  times  are  defined  by 
the  architecture,  the  local  execution  time  of  a  program  can  be  calculated  directly  from  its 
dynamic  instruction  count.  On  the  other  hand,  the  time  to  service  the  accesses  to  shared 
memory  depends  not  only  on  the  data  and  instruction  access  patterns  of  the  local  processor 


16 


but  also  on  the  activities  of  other  processors  in  the  system. 

Extending  this  model  to  parallel  processing,  a  parallel  FORTRAN  program  can  be  mod¬ 
eled  as  a  collection  of  program  segments,  that  we  will  call  task-pieces,  where  each  task-piece 
is  a  sequential  part  of  the  application  which  executes  on  a  single  processor.  The  depen¬ 
dences  among  task-pieces  are  enforced  by  events.  The  two  special  events  program-start  and 
program-end  mark  the  beginning  and  the  completion  of  a  program.  With  this  model,  the 
execution  of  a  sequential  program  consists  of  a  program-start  event  followed  by  a  task-piece 
which  covers  the  whole  of  the  program  execution,  and  a  program.end  event.  For  parallel 
FORTRAN  programs,  the  additional  events  are:  execution  of  DOALL  and  DOACROSS 
statements,  beginning  and  end  of  parallel  loop  iterations,  barriers,  and  execution  of  Ad¬ 
vance/Await  statements. 

A  high  level  trace  is  the  record  of  events  that  took  place  during  execution  and  the  in¬ 
formation  about  task-pieces  executed  between  pairs  of  events.  In  this  study,  the  high  level 
traces  are  collected  by  manual  source  code  instrumentation  of  parallelized  applications.  In 
the  trace,  each  event  is  identified  by  its  type  (DOALL,  iteration  start,  barrier  etc.)  and 
applicable  arguments  (e.g.,  the  synchronization  point  and  the  iteration  number  for  an  Await 
event).  The  task  pieces  are  represented  by  the  dynamic  count  of  read  and  write  accesses  to 
shared  data  and  the  approximate  number  of  dynamic  instructions  executed. 

An  event  driven,  bus  based  shared  memory  multiprocessor  simulator  is  used  to  calculate 
the  program  execution  time  from  a  high  level  trace.  The  simulator  implements  the  task 
management  and  synchronization  algorithms  for  different  synchronization  piimitives  by  using 
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Table  1:  Relative  timing  parameters  for  simulations 


component/operation 

cycle  time 

processor  (base) 

1 

memory  bus 

1 

memory  module 

3 

test&set 

6 

exchange-byte 

6 

fetch&add 

6 

lock/unlock  (synchronization  bus) 

1 

fetch&add  (synchronization  bus) 

1 

the  algorithms  described  in  Section2.  In  our  experiments,  the  atomic  operations  test&set, 
exchange-byte  and  fetch&add  are  implemented  in  shared  memory.  The  processor  memory 
interconnection  is  a  decoupled  access  bus  whose  cycle  time  is  equal  to  the  processor  cycle 
time.  We  assumed  that  shared  memory  is  8-way  interleaved  where  an  access  to  a  module 
takes  3  bus  cycles.  Atomic  operations  that  are  implemented  in  memory  take  two  memory 
cycles  (e.g.  a  test&set  operation  takes  6  bus  cycles  to  execute  in  memory).  When  support  of 
a  synchronization  bus  is  evaluated,  a  single  cycle  access  synchronization  bus  model  is  used. 
A  summary  of  the  timings  parameters  is  shown  in  Table  1. 

In  the  simulation  model,  an  invalidation  based  write-back  cache  coherence  scheme  is  used. 
The  atomic  operations  implemented  in  shared  memory  are  assumed  to  be  write-through  and 
result  in  a  single  word  bus  transaction.  This  allows  caching  of  synchronization  variables  (a 
necessity  for  efficiency  of  spin-locks)  with  reduced  bus  traffic  for  atomic  operations.  The 
event  in  a  parallel  program  are  simulated  at  the  level  of  individual  bus  transactions,  taking 
into  account  the  contention  at  the  bus  and  memory  module  access  conflicts. 
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Table  2:  Assumptions  for  memory  traffic 


parameter 

value 

memory/instruction  ratio 

shared  data  cache  miss  rate 

non-shared  data  cache  miss  rate 

EH! 

The  other  component  of  shared  memory  traffic  is  the  data  needs  of  task-pieces.  The 
shared  memory  traffic  contributed  by  the  application  is  modelled  based  on  the  measured 
instruction  count  and  frequency  of  shared  data  accesses.  Table  2  lists  the  assumptions  used 
to  ‘simulate  the  memory  traffic  for  the  task-pieces.  We  assume  that  20%  of  the  instructions 
executed  are  memory  references.  In  addition,  we  measured  that  6-8%  of  all  instructions 
(approximately  35%  of  all  memory  references)  are  to  shared  data.  We  assume  that  references 
to  shared  data  cause  the  majority  of  cache  misses  (80%  shared  data  cache  miss  rate  and  5% 
noil-shared  data  cache  miss  rate)3. 

The  bus  transactions  due  to  the  cache  misses  of  the  task  pieces  are  combined  with  those 
contributed  by  the  servicing  of  events  to  simulate  the  overall  system  behavior.  The  simulation 
is  performed  on  a  cycle-by-cycle  basis. 


4  Analysis  of  loop  scheduling  overhead 


In  the  execution  of  a  parallel  loop,  the  efFect  of  loop  scheduling  overhead  on  performance 
depends  on  the  number  of  processors,  total  number  of  iterations,  and  the  size  of  an  iteration. 


3When  we  repeated  the  experiments  by  lowering  the  shared  cache  miss  rate  to  40%,  the  speedup  figures 
reported  in  Section  5  changed  by  less  than  2%. 
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In  this  section  we  will  first  derive  the  expressions  for  speedup  in  executing  parallel  loops  where 
the  loop  iterations  are  large  (coarse  granularity)  and  where  the  loops  iterations  are  small 
(fine  granularity).  These  expressions  pro\ide  an  insight  to  how  loop  scheduling  overhead 
influences  loop  execution  time,  and  will  be  used  in  analysis  of  simulation  results  later  in  this 
section. 

Consider  a  DOALL  loop  with  N  iterations  where  each  iteration,  without  any  parallel 
processing  overhead,  takes  f;  time  to  execute.  For  a  given  synchronization  primitive  and 
lock  algorithm,  let  tach  be  the  time  it  takes  for  a  processor  to  schedule  an  iteration.  We  will 
look  at  the  impact  of  scheduling  overhead  for  two  cases.  For  the  first  ca*e  v?  assume  that 
when  a  processor  is  scheduling  an  iteration,  it  is  the  only  processor  doing  so.  When  the 
accesses  to  the  shared  counter  in  loop  scheduling  algorithm  are  implemented  with  a  lock, 
t3ch  can  be  written  as 

tsch  ~  ^lock  "f  ^update  *1*  t unlock 

where  tiock  and  tuniock  are  the  time  it  takes  to  acquire  and  release  a  lock  respectively,  and 
tupdate  is  the  time  for  reading  and  incrementing  the  shared  counter.  The  execution  of  several 
iterations  of  a  loop  for  this  case  is  shown  in  Figure  10.  In  the  figure,  when  P\  completes  the 
execution  of  the  first  iteration,  it  schedules  the  next  iteration  without  delay. 

For  any  given  P  and  t3Ch >  the  necessary  condition  for  this  case  is 

ti  >  (P  -  1)  X  tsch, 

and  the  time  to  execute  the  loop  with  P  processors  can  be  written  as 
tr  =  ({tSch  +  ti)x\N/P])+toh, 


20 


where  t0h  is  the  total  task  scheduling  and  barrier  synchronization  overhead  per  processor. 
Since  the  task  scheduling  and  b^rriei  synchronization  overhead  depends  only  on  the  number 
of  processors,  t0h,  is  constant  for  a  given  P. 

The  execution  time  of  the  sequential  version  of  this  loop,  t3eq,  is  t\  x  N  which  is  not 
equal  to  t\  —  single  processor  execution  time  of  the  parallel  loop.  We  define  speedup  for  P 
processors  as  the  ratio  of  tseq  to  tp.  The  speedup  for  a  POALL  loop  is 


speedup  =  —%■ 
tP 

_ UN _ 

{(iach  +  tl)  X  \N/P])+toh 
P 

tl  ‘  Nxti 


for  N  ^  P 


speedup 


P  x 


tsch  "t"  t{ 


using  t[>  (P-  1)  x  t3ch 

speedup  >  P  x  t  - - 

pf T  + 


>  P-1 

Therefore,  when  U  >  (P  - 1)  x  t3Ch,  the  speedup  is  linear  with  number  of  processors  hence 
the  execution  time  depends  only  on  P  and  the  total  amount  of  work  in  the  loop,  N  x  tp 
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Now  let  us  consider  the  case  where  a  processor  completing  the  execution  of  an  iteration 

always  has  to  wait  to  schedule  the  next  iteration  because  of  another  processor  scheduling  an 

* 

iteration  at  that  time.  We  will  call  the  scheduling  overhead  for  this  case  t'3ch.  This  scenario 
is  illustrated  in  Figure  11.  In  this  figure  after  the  completion  of  the  first  iteration,  Pi  waits 
completion  of  the  scheduling  operations  by  other  processors  before  scheduling  an  iteration. 
In  general,  t'3ch  is  different  from  tach  because  of  the  contention  caused  by  several  processors 
trying  to  schedule  an  iteration.  The  necessary  condition  for  this  case  is 

!,  <  (P  -  1)  X  t'Kh, 

and  the  loop  scheduling  overhead  forms  the  critical  path  in  determining  the  loop  execution 
time.  When  loop  scheduling  becomes  the  bottleneck,  execution  time  is: 

tp  =  N  x  t'3ch  4-  U , 

for  N  >  P 


tp^N  x  t'sch. 


When  the  loop  scheduling  algorithm  is  implemented  with  lock  operations,  scheduling  an 
iteration  involves  transferring  the  ownership  of  the  lock  from  one  processor  to  the  next,  and 
reading  and  incrementing  the  shared  counter.  Therefore 

^ sch  —  Uock  —  transfer  4"  t update • 

In  the  remainder  of  this  section  we  will  first  look  at  how  loop  execution  time  varies 
with  loop  granularity.  Since  t'sch  directly  influences  the  execution  time  of  a  loop,  we  will  then 
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measure  this  loop  scheduling  overhead  for  different  hardware  synchronization  primitives  using 
our  simulation  tool. 

4.1  Granularity  effects 

The  analysis  above  shows  the  existence  of  two  different  types  of  behavior  of  execution  time 
for  a  parallel  loop.  Given  a  multiprocessor  system,  the  parameters  P,  tsch  and  t'sch  do  not 
change  from  one  loop  to  another.  Keeping  these  parameters  constant,  the  granularity  of  a 
loop,  ti,  determines  whether  scheduling  overhead  will  be  significant  in  overall  execution  time 
or  not. 

The  ai  Jiitectural  support  for  synchronization  primitives  influences  the  execution  time  of 
a  parallel  ioop  in  two  ways.  On  one  hand,  different  values  of  t'3ch  for  different  primitives  result 
in  different  execution  time  when  the  loop  iterations  are  small  (i.e.,  fine  granularity  loops). 
On  the  other  hand  tsch  determines  whether  a  loop  is  of  fine  or  coarse  granularity.  In  this 
section  we  present  the  simulation  results  on  how  loop  execution  time  varies  across  different 
implementations  of  the  loop  scheduling  algorithm.  Since  t'ach  determines  the  execution  time 
of  a  fine  granularity  loops,  we  quantify  how  t'sch  changes  with  synchronization  primitives 
used,  and  the  number  of  processors  in  the  system.  In  our  simulations,  the  memory  access 
characteristics  of  the  loops  were  modelled  as  presented  in  Table  2. 

Figures  12-15  show  the  simulation  results  for  execution  time  vs.  the  size  of  an  iteration  in 
a  DO  ALL  ioop.  The  loop  sizes  are  in  terms  of  the  number  of  instructions,  and  the  execution 
time  in  terms  of  CPU  cycles.  In  these  simulations,  the  total  amount  of  instructions  in  the 
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Figure  12:  Execution  time  vs.  granularity  for  test&set  primitive 

loop  is  kept  constant  while  changing  the  number  of  instructions  in  an  iteration.  It  can  be 
seen  in  these  figures  that  when  loop  iterations  are  very  large,  the  execution  time  of  a  loop 
on  a  given  number  of  processors  is  that  same  for  different  synchronization  primitives.  There 
is  also  a  monotonic  increase  in  execution  time  as  loop  size  gets  smaller  in  all  cases. 

Figure  12  shows  that  for  16  processors  and  using  test&set  primitive,  there  is  a  sharp 
increase  in  execution  time  when  iteration  size  is  less  than  550  instructions.  This  number 
is  around  300  for  exchange-byte,  and  200  for  a  synchronization  bus,  see  Figures  13  and  14. 
As  shown  in  Figure  15,  using  the  fetch&add  primitive,  the  iteration  size  where  execution 
time  starts  increasing  is  around  100  instructions.  We  observed  that  in  the  FORTRAN 
programs  we  used  in  our  experiments  the  iteration  sizes  of  the  parallel  loops  vary  from  20  to 
1000  instructions.  This  shows  that  the  choice  of  a  synchronization  primitive  will  influence 
the  perrformance  of  some  loops.  The  distribution  of  instructions  in  the  dynamic  execution 
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Figure  13:  Execution  time  vs.  granularity  for  exchange-byte  primitive 


Figure  14:  Execution  time  vs.  granularity  for  synchronization  bus 
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Figure  15:  Execution  time  vs.  granularity  for  fetch&add  primitive 


traces  with  respect  to  loop  granularity  for  the  application  programs  is  presented  in  Section  5. 


4.2  Scheduling  overhead  for  fine  grain  loops 

For  fine  grain  loops,  the  loop  execution  time  Tp  is  approximated  by  N  x  t'sch.  The  change  of 
execution  time  with  respect  to  the  number  of  iterations  of  a  loop  is  shown  in  Figures  16-19. 
The  synthetic  loops  used  in  these  simulations  has  a  total  of  220000  instructions.  Therefore, 
the  region  where  iteration  size  <  50  instructions  corresponds  to  N  >  4400  in  these  figures. 
The  common  observation  from  these  figures  is  that  when  loop  iterations  are  sufficiently 
small  (V  is  sufficiently  large),  the  execution  time  increases  linearly  with  N.  Also,  when 
extrapolated,  Tp  vs.  N  lines  go  through  the  origin  which  validates  the  linear  model 

TP  =  N  x  t'sch 

for  execution  time. 


28 


Figure  16:  Execution  time  vs.  number  of  iterations  for  test&set  primitive 


Figure  17:  Execution  time  vs.  number  of  iterations  for  exchange-byte  primitive 
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Figure  18:  Execution  time  vs.  number  of  iterations  for  synchronization  bus 


Figure  19:  Execution  time  vs.  number  of  iterations  for  fetch&add  primitive 


Figure  20  shows  how  scheduling  overhead  per  iteration,  t'sch ,  changes  for  the  different 
synchronization  primitives  as  the  number  of  processors  increases. 

Using  the  test&set  primitive,  the  scheduling  overhead  increases  with  number  of  pro¬ 
cessors.  For  the  exchange-byte  and  fetch&add  primitives  and  the  synchronization  bus,  the 
scheduling  overhead  scales  well.  Furthermore  t'ach  shows  great  variance  across  primitives.  For 
the  16  processor  case  the  average  number  of  cycles  to  schedule  a  loop  iteration  are  98,  31, 
17  and  7  cycles  for  test&set,  exchange-byte,  synchronization  bus,  and  fetch&add  primitives 
respectively. 

The  synchronization  bus  model  used  in  these  simulations  has  single  cycle  access  time 
for  free  locks  and  single  cycle  lock  transfer  time.  Therefore  the  synchronization  bus  data 
shows  the  highest  performance  achievable  by  hardware  support  for  lock  accesses  alone.  In 
Section  6,  the  performance  figures  for  a  synchionization  bus  which  also  supports  single  cycle 
fetch&add  operation  are  given.  Such  a  synchronization  bus  is  capable  of  scheduling  a  loop 
iteration  every  clock  cycle.  Therefore  its  overall  performance  can  be  expected  to  be  better 
than  all  the  primitives  analysed  in  this  section. 

5  Synchronization  Characteristics  of  Applications 

5.1  Loop  granularity  of  application  programs 

The  two  applications  we  used  in  this  study  are  BDNA  and  FL052.  BDNA  is  a  molecular  dy¬ 
namics  simulator  for  biomolecules  in  water  and  it  uses  ordinary  differential  equation  solvers. 
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Figure  20:  Loop  scheduling  overhead  vs.  number  of  processors 

FL052  is  a  fluid  dynamics  program  which  uses  multigrid  schemes  and  ordinary  differential 
equation  solvers.  Both  programs  are  vectorizable. 

These  programs  have  different  parallelism  structures  and  loop  granularity.  In  the  BDNA 
program,  most  of  the  parallel  loops  are  not  nested  and  the  iterations  are  200-1000  instructions 
long.  Two  thirds  of  all  parallel  loops  in  the  trace  are  DOACROSS  loops.  In  the  FL052 
program,  most  of  the  parallelism  exists  in  the  form  of  nested  DOALL  loops.  The  size  of  the 
innermost  loop  iterations  varies  between  20-250  instructions. 

The  cumulative  distribution  of  instructions  executed  with  respect  to  the  iteration  size 
of  parallel  loops  is  shown  in  Figure  21  for  both  programs.  Because  the  analysis  of  loop 
scheduling  showed  that  execution  of  loops  with  iterations  larger  than  500  instructions  do 
not  suffer  from  scheduling  overhead,  only  the  loops  with  iterations  less  than  500  instructions 
are  of  concern.  The  BDNA  curve  in  Figure  21  shows  that  for  this  program  only  17%  of  all 
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Figure  21:  Cumulative  distribution  of  dynamic  instructions  for  loop  iteration  size 

instructions  were  executed  in  loops  with  iteration  size  less  than  500  instructions.  On  the 
other  hand,  more  than  half  of  the  total  computation  in  FL052  program  is  done  in  loops 
where  iterations  have  less  than  100  instructions.  From  the  analysis  and  simulation  results  in 
the  previous  section,  we  can  expect  the  performance  of  FL052  program  be  limited  by  loop 
scheduling  overhead. 

5.2  Locality  of  lock  accesses  in  synchronization  algorithms 

In  our  simulations,  we  observed  that  both  programs  exhibit  very  low  locality  for  lock  accesses. 
When  a  processor  acquires  a  lock,  we  consider  it  a  lock  hit  if  the  processor  which  released 
the  lock  last  is  the  same  processor.  Otherwise,  acquiring  a  lock  is  said  to  result  in  a  lock 
miss.  The  measured  lock  hit  rate  for  the  two  programs  with  4  or  more  processors  was  less 
than  0.2%.  Such  a  low  value  of  lock  locality  can  be  explained  by  the  dynamic  behavior  of 
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scheduling  and  synchronization  algorithms. 

For  each  parallel  loop,  every  processor  acquires  the  task  queue  lock  and  barrier  lock  only 
once.  This  results  in  a  round-robin  style  accesses  to  these  locks.  For  the  same  parallel  loop, 
the  loop  counter  lock  used  in  the  loop  self-scheduling  algorithm  is  accessed  multiple  times 
by  each  processor.  However,  a  lock  hit  can  occur  only  when,  the  last  processor  which  got 
an  iteration  number  finishes  execution  of  that  iteration  before  the  executions  of  previously 
scheduled  iterations  complete.  Due  to  the  very  low  variance  of  iteration  size  among  the 
iterations  of  a  parallel  loop  in  these  programs,  this  scenario  is  unlikely. 

In  the  experiments,  because  of  the  low  lock  hit  rate,  the  atomic  memory  operations 
are  implemented  in  shared  memory.  An  implementation  of  atomic  operations  in  caches  or 
processors  would  result  in  excessive  invalidation  traffic,  and  would  also  increase  the  latency 
of  atomic  operations. 

6  Experimental  Results 

In  this  section  we  present  the  experimental  results  for  two  programs  from  the  Perfect  Club 
benchmark  set:  BDNA  and  FL052.  Speedup  of  parallel  programs  with  respect  to  the  ex¬ 
ecution  time  of  the  sequential  version  of  the  programs  is  used  as  the  performance  metric. 
There  is  no  parallel  processing  overhead  in  the  sequential  version.  In  Section  6.1  the  is¬ 
sues  of  lock  contention  and  lock  access  latency  are  discussed.  For  this  analysis  we  use  the 
test&set  atomic  operation  to  implement  the  test&test&set  algorithm  and  the  exchange-byte 
atomic  operation  to  implement  the  queuing  lock  algorithm.  In  Section  6.2,  we  piesent  the 
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performance  implications  of  more  sophisticated  synchronization  support. 

6.1  Lock  contention  and  latency 

The  first  issue  we  focus  on  is  lock  contention.  In  the  first  set  of  simulations,  we  used  the 
test&test&set  algorithm  to  implement  lock  accesses.  The  speedup  obtained  from  the  two 
programs  are  shown  in  Figure  22  and  Figure  23.  In  the  BDNA  program  (Figure  22),  using 
the  test&test&set  algorithm,  the  peak  speedup  of  5.75  is  reached  with  12  processors.  In 
FLOe^  (Figure  23),  a  speedup  of  1.14  is  obtained  with  4  processors.  Increasing  the  number 
of  processors  beyond  4  makes  the  performance  worse. 

The  speedup  for  the  single  processor  case  is  0.86  for  BDNA  and  0.54  for  FL052.  This 
data  shows  that,  running  on  a  single  processor,  FL052  spends  almost  half  of  its  execution 
time  in  scheduling  and  synchronization  algorithms. 
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Figure  23:  Lock  latency  and  contention  effects  on  FL052  program 

In  the  second  set  of  experiments,  we  used  a  queuing  lock  algorithm  for  lock  accesses  to 
observe  the  effect  of  decreasing  lock  contention.  As  shown  in  Figures  22  and  23,  the  queuing 
lock  significantly  increases  the  performance  of  both  programs  when  the  number  of  processors 
is  large.  With  16  processors,  the  speedup  improves  by  50%  in  BDNA  and  200%  in  FL052. 
These  results  show  that  controlling  lock  contention  with  algorithms  such  as  queuing  lock 
does  increase  program  performance. 

The  next  issue  we  looked  at  was  the  importance  of  lock  access  latency.  The  effect  of 
doubling  the  lock  access  latency  in  a  queuing  lock  is  shown  in  Figures  22  and  23  (slow 
queuing  lock).  For  the  16  processor  case,  doubling  the  lock  access  latency  decreases  the 
speed  up  by  10%  for  BDNA  and  by  50%  for  FL052. 
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Figure  24:  Performance  of  BDNA  program  with  the  use  of  synchronization  bus  and 
fetch&add  primitive 

6.2  Efficient  architectural  support  for  synchronization 

In  the  previous  section,  we  pointed  out  the  importance  of  efficient  lock  operations.  The  next 
issue  is  the  effect  of  using  a  synchronization  bus  for  lock  operations  on  program  performance. 
A  synchronization  bus  allows  execution  of  synchronization  operations  with  minimal  latency 
and  isolates  synchronization  traffic  from  memory  traffic. 

As  shown  in  Figure  22,  the  speedup  obtained  from  BDNA  using  a  synchronization  bus  is 
8.25  for  the  16  processor  case.  The  improvement  in  speedup  over  the  queuing  lock  case  (see 
Figure  22)  ranges  from  2%  to  7%  when  the  number  of  processors  changes  from  8  to  16.  For 
the  FL052  program,  from  Figure  23  it  can  be  seen  that  using  a  synchronization  bus  has  a 
dramatic  effect  on  program  execution  time.  The  speedup  obtained  for  8  or  more  processors 
is  2.5  which  is  60%  higher  than  the  speedup  in  the  case  of  a  queuing  lock. 
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Figure  25:  Performance  of  FL052  program  with  the  use  of  synchronization  bus  and 
fetch&add  primitive 

As  discussed  in  Section  2.3,  a  significant  number  of  the  lock  accesses  required  to  increment 
the  shared  counters  can  be  eliminated  with  the  support  of  a  fetch&add  primitive  in  hardware. 
In  the  next  experiment,  the  fetch&add  operation  (implemented  in  shared  memory)  was 
used  for  incrementing  shared  counters  in  loop  self-scheduling  and  barrier  synchronization 
algorithms.  The  results  of  these  simulations  are  shown  in  Figures  24  and  25  for  programs 
BDNA  and  FL052  respectively. 

In  the  BDNA  program,  the  speedup  obtained  by  using  a  fetch&add  primitive  implemented 
in  hardware  is  the  same  as  the  speedup  obtained  by  using  a  synchronization  bus  for  lock 
accesses.  In  FL052,  a  similar  behavior  is  observed.  Therefore,  the  performance  benefits 
of  a  dedicated  synchronization  bus  for  lock  accesses  can  be  achieved  at  a  lower  cost  by 
implementing  an  atomic  fetch&operation  in  shared  memory. 

Finally,  we  considered  the  case  where  a  synchronization  bus  is  used  to  implement  both 
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lock  accesses  and  fetch&add  operation.  As  shown  in  Figure  24,  this  resulted  in  a  marginal 
increase  in  performance  for  the  BDNA  program.  However,  for  the  synchronization  bound 
program  FL052,  Figure  25  shows  that  the  performance  increase  is  in  excess  of  100%,  reaching 
5.3  for  16  processors. 

7  Related  Work 

There  has  been  considerable  attention  paid  to  the  synchronization  problem  for  multiproces¬ 
sors.  Brooks  proposed  the  Butterfly  Barrier  [16]  which  does  not  have  the  hot  spots  observed 
in  linear  barriers.  Gupta’s  “Fuzzy  Barrier”  improves  processor  utilization  by  allowing  proces¬ 
sors  to  do  useful  work  in  a  barrier  as  a  result  of  compile  time  analysis.  The  barrier  algorithm 
we  presented  overlaps  barrier  execution  with  useful  work  by  exploiting  parallelism  in  nested 
parallel  loops.  An  analytical  analysis  of  different  barrier  synchronization  algorithms  'were 
made  by  Arenstrof  and  Jordan  [4].  Beckmann  and  Polychronopoulos  studied  the  effect  of 
barrier  synchronization  and  scheduling  overhead  and  presented  a  similar  analytical  formula¬ 
tion  for  execution  time  characterization  for  loop  scheduling  bound  execution  [5].  They  used 
synthetic  parallel  loops  as  workload  in  their  experiments.  Polychronopoulos  also  developed 
guided  self-scheduling  scheme  for  loop  scheduling  [17].  In  the  future,  we  plan  to  evaluate  per¬ 
formance  implications  of  alternative  loop  scheduling  and  barrier  synchronization  algorithms 
for  parallel  scientific  applications. 

Finally,  another  experimental  study  on  synchronization  in  real  parallel  applications  was 
done  previously  by  Davis  and  Iiennesey  [10].  Their  work  concentiated  on  how  program 
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characteristics  change  synchronization  behavior.  For  their  class  of  applications,  they  con¬ 
cluded  that  implementation  of  synchronization  operations  have  little  effect  on  program  per¬ 
formance. 


8  Concluding  Remarks 

In  this  paper  we  demonstrated  the  feasibility  of  concentrating  on  one  aspect  of  parallel  pro¬ 
gram  execution  within  the  perspective  of  the  overall  program  performance.  We  analyzed 
the  performance  implications  of  synchronization  support  for  FORTRAN  programs  paral¬ 
lelized  by  a  state-of-the-art  compiler.  In  these  programs,  parallelism  was  exploited  at  loop 
leve’  where  the  granularity  of  loops  showed  large  variance  across  applications.  We  addressed 
the  task  management  and  synchronization  issues  that  arise  in  executing  these  programs  at 
different  levels  of  abstraction.  We  presented  dynamic  task  management  and  barrier  synchro¬ 
nization  algorithms  for  efficient  execution  of  programs  with  nested  parallel  loops.  The  issues 
in  implementation  of  the  atomic  lock  access  and  counter  increment  operations  that  are  used 
in  loop  self-scheduling,  task  management  and  barrier  synchronization  were  addressed  at  the 
level  of  hardware  synchronization  primitives. 

In  the  execution  of  parallel  FORTRAN  programs,  we  focused  on  loop  scheduling  overhead 
as  the  potential  cause  of  performance  degradation.  Loop  scheduling  overhead  was  shown  to 
determine  execution  time  for  fine  granularity  loops  and  to  vary  significantly  with  the  archi¬ 
tectural  synchronization  support.  The  synchronization  algorithms  used  in  executing  these 
programs  depend  heavily  on  shared  counters.  In  accessing  shared  counters,  we  concluded 
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that  lock  algorithms  which  reduce  bus  contention  do  enhance  performance.  For  the  class 
of  applications  we  looked  at,  due  to  the  importance  placed  on  shared  counters,  a  hardware 
implementation  of  a  fetc’n&add  primitive  in  shared  memory  can  be  as  effective  as  a  special 
synchronization  bus  which  handles  lock  accesses. 

We  observed  that  in  the  execution  of  parallel  FORTRAN  programs,  there  is  a  very 
low  locality  of  lock  accesses.  This  implies  that  implementing  atomic  operations  in  private 
caches  or  in  processors  rather  than  in  shared  memory  will  result  in  loss  of  performance  and 
additional  memory  traffic. 

The  simulation  results  with  real  program  traces  showed  that  while  for  an  application 
with  fine  granularity  loops  the  execution  time  showed  large  variance  across  synchronization 
primitives,  performance  of  an  application  with  coarse  granularity  is  less  sensitive  to  the 
particulars  of  hardware  synchronization  support.  The  simulation  results  with  real  parallel 
application  traces  showed  that  the  choice  of  the  hardware  synchronization  primitive  in  a 
shared-memory  multiprocessor  does  have  a  significant  effect  on  overall  program  performance. 

Our  assumptions  on  system  architecture  were  targeted  to  increase  the  throughput  of 
synchronization  operations  by  implementing  them  in  shared  memory,  and  to  decrease  the 
interaction  between  regular  memory  operations  and  synchronization  primitives  by  using  split 
phase  transactions.  On  architectures  with  longer  memory  access  latency  or  where  atomic 
operations  consume  more  shared  memory  bus  bandwith,  we  expect  to  see  a  more  severe 
performance  degredation  due  to  synchronization  overhead. 
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